According to American Centers for Disease Control and Prevention, stroke can happen at any age of a person. "Every year, more than 795,000 people in the United States have a stroke." The good news is that people can protect themselves by understanding and controlling the risk factors for stroke [1]. Mayo Clinic defines stroke this way: "A stroke occurs when the blood supply to part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die in minutes.". It points out that some potentially treatable stroke risk factors are obesity, lack of physical activity, high blood pressure, cigarette smoking, high cholesterol, diabetes, cardiovascular disease, etc [2]. It is of great interest to analyze the stroke prediction dataset available and find out the roles each attribute play in stroke risk prediction.
The analysis for lab 1 is very meaningful and data will speak the truth. Findings and conclusions from the analysis will have the following benefits: 1) We will gain first-hand statistics and visualization about stroke risk factors. 2) Findings are expected to provide solid evidence for stroke risk prediction for medical physicians and heathcare professionals, who stay in the front line of patient care and should have access to accurate information for stroke prevention to minimize misdiagnosis. 3) Discoveries can also benefit fitness professionals. For example, fitness trainers can make recommendations and create training programs tailored for clients to help lower the risk of stroke. 4) Conclusions and prediction models can be incorported into electronic gadgets such as fitbit or apple watch to help monitor people's stoke riks factors and alert pepole when the risk is high.
Stroke misdiagnosis is a major healthcare concern, with initial misdiagnosis estimated to occur in 9% of all stroke patient cases in the emergency setting [4]. Each year about 1.2 million people in the US may have a stroke or are at a high risk of an impending stroke. If the misdiagnosis rate can be reduced only by 1 percent, we can save approximately 12,000 lives annually.
The high probability of incorrect diagnoses encourages us to minimize false positive and false negative stroke diagnoses in the prediction model. Because the model will be used to predict rather than diagnose, false positive cases can be tolerated much more than false negative ones. In other words, if a patient has the potential for a stroke, we prefer to alert the physicians and the patients for further examination.
The Stroke Prediction dataset downloaded from Kaggle was chosen for lab 1. This dataset includes patients' information that may be helpful for predicting if a patient is at risk of stroke.
The dataset can be accessed from here:
Dataset Author: fedesoriano
The list below describes the attrubutes in the dataset [3]:
# import modules needed for analysis
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import missingno as mn
dataset_url = 'https://raw.githubusercontent.com/Hessam64/data/main/healthcare-dataset-stroke-data.csv'
df = pd.read_csv(dataset_url)
df.head()
| id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9046 | Male | 67.0 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
| 1 | 51676 | Female | 61.0 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | 1 |
| 2 | 31112 | Male | 80.0 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
| 3 | 60182 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
| 4 | 1665 | Female | 79.0 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
df.tail()
| id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5105 | 18234 | Female | 80.0 | 1 | 0 | Yes | Private | Urban | 83.75 | NaN | never smoked | 0 |
| 5106 | 44873 | Female | 81.0 | 0 | 0 | Yes | Self-employed | Urban | 125.20 | 40.0 | never smoked | 0 |
| 5107 | 19723 | Female | 35.0 | 0 | 0 | Yes | Self-employed | Rural | 82.99 | 30.6 | never smoked | 0 |
| 5108 | 37544 | Male | 51.0 | 0 | 0 | Yes | Private | Rural | 166.29 | 25.6 | formerly smoked | 0 |
| 5109 | 44679 | Female | 44.0 | 0 | 0 | Yes | Govt_job | Urban | 85.28 | 26.2 | Unknown | 0 |
Looking at the first and last five rows of the raw dataset, we see that it has 12 attributes. The 'id' attribute is identification numbers for each subject and it is not essential for stroke prediction and will be remove later. Attributes named 'gender', 'ever_married', 'work_type', 'Residence_type' and 'smoking_status' are categorical with strings as categories. Attributes named 'hypertension', 'heart_disease' and 'stroke' are categorical with integers as categories. Attributes named 'age', 'avg_glucose_level' and 'bmi' are continuous variables. Moreover, the 'stroke' attribute should be the response variable.
print(df.info())
print(f"Total number of observations in the dateset is {len(df)}.")
print(f"The proportion of missing values for 'bmi' attribute is {(1- 4909/len(df)):.4f}.")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5110 entries, 0 to 5109 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 5110 non-null int64 1 gender 5110 non-null object 2 age 5110 non-null float64 3 hypertension 5110 non-null int64 4 heart_disease 5110 non-null int64 5 ever_married 5110 non-null object 6 work_type 5110 non-null object 7 Residence_type 5110 non-null object 8 avg_glucose_level 5110 non-null float64 9 bmi 4909 non-null float64 10 smoking_status 5110 non-null object 11 stroke 5110 non-null int64 dtypes: float64(3), int64(4), object(5) memory usage: 479.2+ KB None Total number of observations in the dateset is 5110. The proportion of missing values for 'bmi' attribute is 0.0393.
df.shape
(5110, 12)
From the brief description of data, we see that the dataset has 5110 rows and 12 columns. Only the column named "bmi" has missing values. Because the proportion of missing values for 'bmi' attribute is only 0.0393, we think imputation for missing values is a valid option. Please see the details for imputation in later sections.
Another way to check missing values is shown below. We see that attribute 'bmi' has 201 missing values. People likely forgot to collect "bmi" information, or they did not have the required equipment. All other attributes do not have missing values.
# contains null values: Age, Cabin, Embarked
# show all missing values
total_missing_values = df.isnull().sum().sort_values(ascending = False)
total_missing_values
bmi 201 stroke 0 smoking_status 0 avg_glucose_level 0 Residence_type 0 work_type 0 ever_married 0 heart_disease 0 hypertension 0 age 0 gender 0 id 0 dtype: int64
We can also visualize the number of missing values in each attribute.
mn.matrix(df.sort_values(by=["bmi"]),
figsize=(10,5), fontsize=12,
color=(0.1, 0.2, 1))
<AxesSubplot:>
Each blue bar represents an attribute in the dataset. White gap in the blue bar indicates missing values. Clearly, 'bmi' has missing values and all other attributes do not have missing values.
Each field has an acceptable data range or data elements. Next, we investigate if any abnormal data exists in the dataset.
#Find abnormal values in dataset
def validate_date(df) :
valid_attr = np.array([])
valid_attr = np.append(valid_attr, {'gender' : {'Male', 'Female'} == set(df['gender'].unique())})
valid_attr = np.append(valid_attr, {'age' : np.logical_and(df['age'].unique()> 0, df['age'].unique() <120).all()})
valid_attr = np.append(valid_attr, {'hypertension' : {0, 1} == set(df['hypertension'].unique())})
valid_attr = np.append(valid_attr, {'heart_disease' : {0, 1} == set(df['hypertension'].unique())})
valid_attr = np.append(valid_attr, {'ever_married' : {'Yes', 'No'} == set(df['ever_married'].unique())})
valid_attr = np.append(valid_attr, {'work_type' : {'Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'} == set(df['work_type'].unique())})
valid_attr = np.append(valid_attr, {'Residence_type' : {'Urban', 'Rural'} == set(df['Residence_type'].unique())})
valid_attr = np.append(valid_attr, {'avg_glucose_level' : np.logical_and(df['avg_glucose_level'].unique()> 0, df['avg_glucose_level'].unique() <1000).all()})
valid_attr = np.append(valid_attr, {'bmi' : np.logical_and(df['bmi'].unique()> 0, df['bmi'].unique() <100).all()})
valid_attr = np.append(valid_attr, {'smoking_status' : {'formerly smoked', 'never smoked', 'smokes', 'Unknown'} == set(df['smoking_status'].unique())})
valid_attr = np.append(valid_attr, {'stroke' : {0, 1} == set(df['stroke'].unique())})
return valid_attr
print(validate_date(df))
print(df.info())
print("Number of duplicated rows: {}".format(len(df) - len(df.drop_duplicates())))
[{'gender': False} {'age': True} {'hypertension': True}
{'heart_disease': True} {'ever_married': True} {'work_type': True}
{'Residence_type': True} {'avg_glucose_level': True} {'bmi': False}
{'smoking_status': True} {'stroke': True}]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5110 non-null int64
1 gender 5110 non-null object
2 age 5110 non-null float64
3 hypertension 5110 non-null int64
4 heart_disease 5110 non-null int64
5 ever_married 5110 non-null object
6 work_type 5110 non-null object
7 Residence_type 5110 non-null object
8 avg_glucose_level 5110 non-null float64
9 bmi 4909 non-null float64
10 smoking_status 5110 non-null object
11 stroke 5110 non-null int64
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB
None
Number of duplicated rows: 0
The above result demonstrates that there is no duplicated observation, but "age" and "bmi" attributes have abnormal data that is either unknow catetory or missing values. We will address these issues in the data cleaning section.
df.describe()
| id | age | hypertension | heart_disease | avg_glucose_level | bmi | stroke | |
|---|---|---|---|---|---|---|---|
| count | 5110.000000 | 5110.000000 | 5110.000000 | 5110.000000 | 5110.000000 | 4909.000000 | 5110.000000 |
| mean | 36517.829354 | 43.226614 | 0.097456 | 0.054012 | 106.147677 | 28.893237 | 0.048728 |
| std | 21161.721625 | 22.612647 | 0.296607 | 0.226063 | 45.283560 | 7.854067 | 0.215320 |
| min | 67.000000 | 0.080000 | 0.000000 | 0.000000 | 55.120000 | 10.300000 | 0.000000 |
| 25% | 17741.250000 | 25.000000 | 0.000000 | 0.000000 | 77.245000 | 23.500000 | 0.000000 |
| 50% | 36932.000000 | 45.000000 | 0.000000 | 0.000000 | 91.885000 | 28.100000 | 0.000000 |
| 75% | 54682.000000 | 61.000000 | 0.000000 | 0.000000 | 114.090000 | 33.100000 | 0.000000 |
| max | 72940.000000 | 82.000000 | 1.000000 | 1.000000 | 271.740000 | 97.600000 | 1.000000 |
The above table provides summary statistics for the numerical attributes. Only those for 'age', 'avg_glucose_level' and 'bmi' are meaningful because they are continuous varialbes. Summary statistics for numeric categorical variables are not as meaningful. Summary statistics for 'id' is not useful and can be ignored.
Column named 'id' is not useful for predicting the risk of stroke. Thus, we will remove it from the dataframe.
# let's clean the dataset a little before moving on
# 1. Remove attributes that just arent useful for us
for col in ['id']:
if col in df:
del df[col]
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5110 entries, 0 to 5109 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5110 non-null object 1 age 5110 non-null float64 2 hypertension 5110 non-null int64 3 heart_disease 5110 non-null int64 4 ever_married 5110 non-null object 5 work_type 5110 non-null object 6 Residence_type 5110 non-null object 7 avg_glucose_level 5110 non-null float64 8 bmi 4909 non-null float64 9 smoking_status 5110 non-null object 10 stroke 5110 non-null int64 dtypes: float64(3), int64(3), object(5) memory usage: 439.3+ KB
As we can see, "bmi", "age" and "avg_glucose_level" are float, which is an appropriate data type to capture these values. "hypertension", "heart_disease", and "stroke" are categorical data that are stored as integers. They can be stored as boolean, but the dataset size is small, and we do not convert the type to boolean. The remaining attributes, "ever_married", "work_type" and "Residence_type" are categorical as well, but they are stored as an object data type (string). We will convert some of them to numerical values with the integer data type for some calculations like correlation and imputation.
Next, we will look at each attribute in detail.
df['gender'].value_counts(ascending = True, dropna=False)
Other 1 Male 2115 Female 2994 Name: gender, dtype: int64
df.shape
(5110, 11)
# Unique values and counts for all columns
df[df['gender'] == 'Other']
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3116 | Other | 26.0 | 0 | 0 | No | Private | Rural | 143.33 | 22.4 | formerly smoked | 0 |
df.drop(df[df['gender'] == 'Other'].index, inplace = True)
The gender attribute has values Male, Female and Other. Since there is only one row with gender unspecified, it is safe to remve that row from the dataset without affecting overall data integrity.
# remove the row with gender unknown
# df = df.drop([3115,])
df.shape
df['gender'].value_counts(dropna=False)
Female 2994 Male 2115 Name: gender, dtype: int64
The row with gender = other is now removed. Next, we will encode the 'gender' attribute with Male = 1, Female = 0.
# df_strCateg is dataframe with string objects for categorical variables
# df will eventually be numerical for for most categorical variables
import copy
df_strCateg = copy.deepcopy(df)
# encode the categorical variables from numerical to strings
df_strCateg['hypertension'] = df_strCateg['hypertension'].map({0:'non-hyperten', 1:'hyperten'})
df_strCateg['heart_disease'] = df_strCateg['heart_disease'].map({0:'non-heart-disease', 1:'heart-disease'})
df_strCateg
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Male | 67.0 | non-hyperten | heart-disease | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
| 1 | Female | 61.0 | non-hyperten | non-heart-disease | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | 1 |
| 2 | Male | 80.0 | non-hyperten | heart-disease | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
| 3 | Female | 49.0 | non-hyperten | non-heart-disease | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
| 4 | Female | 79.0 | hyperten | non-heart-disease | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5105 | Female | 80.0 | hyperten | non-heart-disease | Yes | Private | Urban | 83.75 | NaN | never smoked | 0 |
| 5106 | Female | 81.0 | non-hyperten | non-heart-disease | Yes | Self-employed | Urban | 125.20 | 40.0 | never smoked | 0 |
| 5107 | Female | 35.0 | non-hyperten | non-heart-disease | Yes | Self-employed | Rural | 82.99 | 30.6 | never smoked | 0 |
| 5108 | Male | 51.0 | non-hyperten | non-heart-disease | Yes | Private | Rural | 166.29 | 25.6 | formerly smoked | 0 |
| 5109 | Female | 44.0 | non-hyperten | non-heart-disease | Yes | Govt_job | Urban | 85.28 | 26.2 | Unknown | 0 |
5109 rows × 11 columns
gender_code = {"Male":1, "Female":0}
df['gender'] = df['gender'].map(gender_code)
df['gender'].value_counts(ascending = True, dropna= False)
1 2115 0 2994 Name: gender, dtype: int64
df['age'].value_counts(ascending = True, dropna=False)
0.40 2
0.08 2
0.16 3
0.48 3
1.40 3
...
51.00 86
54.00 87
52.00 90
57.00 95
78.00 102
Name: age, Length: 104, dtype: int64
Age is an interesting attribute. People typically report age as integers except that for babies. From above output, we see that sample data has 104 distinct values for age. The second column lists the count for each unique age value.
df['hypertension'].value_counts(ascending = True, dropna=False)
1 498 0 4611 Name: hypertension, dtype: int64
The 'hypertension' attribute has two distinct values: 1 (have hypertension) and 0 (does not have hypertension).
df['heart_disease'].value_counts(ascending = True, dropna=False)
1 276 0 4833 Name: heart_disease, dtype: int64
The 'heart_disease' attribute has two distinct values: 1 (have heart disease) and 0 (does not have heart disease).
df['ever_married'].value_counts(ascending = True, dropna=False)
No 1756 Yes 3353 Name: ever_married, dtype: int64
The 'ever_married' attribute has two distinct values: Yes and No. We will encode the 'ever_married' attribute with Yes = 1, No = 0.
marry_code = {"Yes":1, "No":0}
df['ever_married'] = df['ever_married'].map(marry_code)
df['ever_married'].value_counts(ascending = True, dropna=False)
0 1756 1 3353 Name: ever_married, dtype: int64
df['smoking_status'].value_counts(ascending = True, dropna=False)
smokes 789 formerly smoked 884 Unknown 1544 never smoked 1892 Name: smoking_status, dtype: int64
smoke_code = {"smokes":1, "formerly smoked":0.75,"Unknown": 0.25, "never smoked":0}
df['smoking_status'] = df['smoking_status'].map(smoke_code)
df['smoking_status'].value_counts(ascending = True, dropna=False)
1.00 789 0.75 884 0.25 1544 0.00 1892 Name: smoking_status, dtype: int64
The 'smoking_status' attribute has four distinct categories. We encode each category with numbers. The values 1, 0.75, 0.25, 0 correspond to 'smokes', 'formerly smoked', 'Unknown' and 'never smoked'.
df['work_type'].value_counts(ascending = True, dropna=False)
Never_worked 22 Govt_job 657 children 687 Self-employed 819 Private 2924 Name: work_type, dtype: int64
We think work type is not important for predicting stroke risk, thus, we do not transform its categories into numerical values.
df['stroke'].value_counts(ascending = True, dropna=False)
1 249 0 4860 Name: stroke, dtype: int64
The stroke attribute has integer values 1 and 0. This attribute looks good.
from sklearn.impute import KNNImputer
knn_obj = KNNImputer(n_neighbors = 6)
features_to_use = ['gender','age','hypertension','heart_disease',
'avg_glucose_level', 'bmi', 'smoking_status']
temp = df[features_to_use].to_numpy()
knn_obj.fit(temp)
temp_imputed = knn_obj.transform(temp)
df_imputed = copy.deepcopy(df)
df_imputed[features_to_use] = temp_imputed
df_imputed.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5109 entries, 0 to 5109 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5109 non-null float64 1 age 5109 non-null float64 2 hypertension 5109 non-null float64 3 heart_disease 5109 non-null float64 4 ever_married 5109 non-null int64 5 work_type 5109 non-null object 6 Residence_type 5109 non-null object 7 avg_glucose_level 5109 non-null float64 8 bmi 5109 non-null float64 9 smoking_status 5109 non-null float64 10 stroke 5109 non-null int64 dtypes: float64(7), int64(2), object(2) memory usage: 479.0+ KB
df_strCateg['bmi'] = df_imputed['bmi']
Now let's see if the imputation method changed the overall histogram drastically.
df_imputed.bmi.plot(kind='hist',
alpha=0.25,
label="KNN-Imputer",
bins=50
)
df.bmi.plot(kind='hist', alpha=0.25,
label="Original",
bins=50
)
plt.legend()
plt.show()
It looks like the KNN imputation works quite well for the 'bmi' attribute because distribution of 'bmi' after imputation is quite similar to that before imputation. We will use this imputation results for further analysis.
# let's break up the age variable
df_imputed['age_range'] = pd.cut(df_imputed['age'],[0,12,18,29,60,100],
labels=['child','teen','young adult','adult','senior']) # this creates a new variable
df_imputed.age_range.describe()
count 5109 unique 5 top adult freq 2291 Name: age_range, dtype: object
df_strCateg['age_range'] = df_imputed['age_range']
df_grouped = df_strCateg.groupby(by=['hypertension','age_range'])
print ("Percentage of having a stroke in each hypertension statusn and age group combination:")
print (df_grouped.stroke.sum() / df_grouped.stroke.count() *100)
Percentage of having a stroke in each hypertension statusn and age group combination:
hypertension age_range
hyperten child NaN
teen 0.000000
young adult 0.000000
adult 6.829268
senior 18.439716
non-hyperten child 0.170068
teen 0.305810
young adult 0.000000
adult 2.684564
senior 12.230920
Name: stroke, dtype: float64
df_grouped = df_strCateg.groupby(by=['age_range','hypertension'])
stroke_risk = df_grouped.stroke.sum() / df_grouped.stroke.count()
ax = stroke_risk.plot(kind='barh')
plt.title('Risk of stroke by subclass combinations')
plt.show()
There are interesting discoveries. We see that no children in the sample had hypertension. For children and teens who did not have hypertension, around 0.170% and 0.306% had a stroke, respectively. For teens and young adults who had hypertension, none had strokes. For young adults who did not have hypertension, none had strokes. For adults and senior adults who did not have hypertension, around 2.684%, and 12.231% had a stroke, respectively. For adults and senior adults who had hypertension, around 6.829% and 18.440% had a stroke, respectively.
The results indicate hypertension is a risk factor for strokes because, in the adult and senior adult group, the chance of having a stroke is much higher if one has hypertension.
It is curious why some children and teens did not have hypertension but had a stroke. A reasonable explanation is that having a stroke is more likely to be caused by genetic or congenital disorders for young children and teens. If one does not have genetic or congenital disorder factors for strokes, he/she has a low risk for stroke in childhood and young adulthood.
perct_stroke = sum(df.stroke == 1)/len(df)
print(f"The percentage of people who had stroke is {perct_stroke: .2%} in the sample.")
The percentage of people who had stroke is 4.87% in the sample.
df_hypertension_grouped = df.groupby(by = 'hypertension')
for val, grp in df_hypertension_grouped:
print('There were',len(grp),'people having hypertension in the',val,'class.')
print(f"""For the hypertension attribute, 0 indicates a person not having
hypertension and 1 indicates a person having hypertension.
There are {498/4611:.2%} of people in the sample having hypertension.""")
There were 4611 people having hypertension in the 0 class.
There were 498 people having hypertension in the 1 class.
For the hypertension attribute, 0 indicates a person not having
hypertension and 1 indicates a person having hypertension.
There are 10.80% of people in the sample having hypertension.
print("""Total number of people having stroke in
non-hypertension group (0) and hypertension group (1):""")
print(df_hypertension_grouped["stroke"].sum())
print('---------------------------------------------------')
print("""The percentage of people having stroke in
non-hypertension group (0) and hypertension group (1):""")
print(df_hypertension_grouped["stroke"].sum() / df_hypertension_grouped.stroke.count())
Total number of people having stroke in non-hypertension group (0) and hypertension group (1): hypertension 0 183 1 66 Name: stroke, dtype: int64 --------------------------------------------------- The percentage of people having stroke in non-hypertension group (0) and hypertension group (1): hypertension 0 0.039688 1 0.132530 Name: stroke, dtype: float64
print(f""" We see that the percentage of people having stroke
in the hypertension group (1) is {0.132530/0.039679:.2f} times that in
non-hypertension group (0). This indicates correlation
and not causation.
""" )
We see that the percentage of people having stroke in the hypertension group (1) is 3.34 times that in non-hypertension group (0). This indicates correlation and not causation.
import seaborn as sns
cmap = sns.set(style="darkgrid")
f, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(df_imputed.corr(), cmap=cmap, annot=True)
# f.tight_layout()
<AxesSubplot:>
From the correlation heatmap, we see that marital status and age have a correlation of 0.68, which is relatively high compared with the rest of them. The second highest correlation is between marriage status and BMI. The third highest correlation is between age and BMI. These findings make common sense because people tend to get married as they get older. As people grow, BMI tends to increase (see the plot below for more evidence). More importantly, the correlation between stroke and age, hypertension, heart disease, marriage status, and blood glucose are mildly strong. The correlation between stroke and BMI and smoking status seems very weak.
ax = df_imputed.boxplot(column='bmi', by = 'age_range') # group by class
plt.ylabel('bmi')
ax.set_yscale('log')
plt.title('Box plot for BMI by age groups')
Text(0.5, 1.0, 'Box plot for BMI by age groups')
Boxplots for BMI for each age group are shown above. We see that there are more outliers with larger BMI for each age group than those with lower BMI. As people grow, median BMI tends to increase up to age 60. We can observe that median BMI decreases slightly for adults over 60.
# Start by just plotting what we previously grouped!
plt.style.use('ggplot')
fig = plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
df_imputed.age.plot.hist(bins=30)
plt.subplot(1,3,2)
df_imputed.age.plot.kde(bw_method=0.3)
plt.subplot(1,3,3)
df_imputed.age.plot.hist(bins=50)
df_imputed.age.plot.kde(bw_method=0.08, secondary_y=True)
plt.show()
The above plots demonstrate that subjects in the sample have an age range from 0+ to 80+. Moreover, the kernel density estimation (KDE) plot in the middle shows that patients between ages 45 and 55 have the highest count in the dataset, and the age distribution is non-normal.
plt.subplots(figsize=(20, 10))
sns.violinplot(x="stroke", y="age", hue="gender", data=df_strCateg,
split=True,
inner="quart",
scale="count")
plt.show()
The split violin plots convey plenty of useful information. In both figures, the red color indicates males, and the blue color is used for females. Patients' distribution with and without stroke history is shown on the right and left, respectively. It is demonstrated that more females than males are in the dataset, and both genders' age distribution is quite similar in stroke history groups. The mean age for patients with stroke history is much higher for both genders.
#t-test
from scipy import stats
# two sample t test for equal mean of age for stroke vs. non-stroke group
age_stroke = (df_strCateg[(df_strCateg['stroke'] == 1)])['age']
age_nonstroke = (df_strCateg[(df_strCateg['stroke'] == 0)])['age']
stats.ttest_ind(age_stroke,age_nonstroke, equal_var = False)
Ttest_indResult(statistic=29.681861373766573, pvalue=2.175773269747784e-95)
Treating people with stroke as sample 1 and people without stroke as sample 2, we test if mean ages are the same for both samples by performing a two independent sample t-test and assuming unequal variance.
The t-test has test statistic 29.68 and a p-value close to 0. Thus, using significance level 0.05, the mean ages of people with and without stroke do not equal.
plt.subplots(figsize=(20, 10))
sns.violinplot(x="stroke", y="bmi", hue="gender", data=df_strCateg,
split=True,
inner="quart",
scale="count")
plt.show()
In both figures, the red color indicates males, and the blue color is used for females. Patients' distribution with and without stroke history is shown on the right and left, respectively. It is demonstrated that more females than males are in the dataset, and both genders' BMI distribution is quite similar in stroke history groups.
#t-test
bmi_stroke = (df_strCateg[(df_strCateg['stroke'] == 1)])['bmi']
bmi_nonstroke = (df_strCateg[(df_strCateg['stroke'] == 0)])['bmi']
stats.ttest_ind(bmi_stroke,bmi_nonstroke, equal_var = False)
Ttest_indResult(statistic=3.7113802764188804, pvalue=0.0002465702727296113)
Treating people with stroke as sample 1 and people without stroke as sample 2, we test if mean BMIs are the same for both samples by performing a two independent sample t-test and assuming unequal variance.
The t-test has test statistic 3.71 and a p-value 0.00024. Thus, using significance level 0.05, the mean BMIs of people with and without stroke are not equal.
Notice that even though the t-test shows a statistically significant difference, it may not be practically significant. There is evidence to back up our claim because the observed mean differences in BMIs are relatively small, as shown in the above split violin plot.
# the cross tabulation operator provides an easy way to get these numbers
plt.figure(figsize=(15,15))
stroke_risk_stackplot = pd.crosstab([df_strCateg['heart_disease'],
df_imputed['age_range']],
df_imputed.stroke.astype(bool)) # how to group
print(stroke_risk_stackplot)
stroke_risk_stackplot.plot(kind='bar', stacked=True)
plt.show()
stroke False True
heart_disease age_range
heart-disease child 1 0
young adult 1 0
adult 59 5
senior 168 42
non-heart-disease child 586 1
teen 327 1
young adult 597 0
adult 2162 65
senior 959 135
<Figure size 1080x1080 with 0 Axes>
We see that no children and young adults had heart disease in the sample data from the above stacked bar chart. No teens and young adults with heart disease had strokes. Adults and senior adults with or without heart disease all have cases of stroke. Senior adults appear to have a higher chance of having strokes.
UMAP is a technique for linear or general non-linear dimension reduction and visualization. The algorithm's three critical assumptions about data are: 1) The data is uniformly distributed on Riemannian manifold. 2) The Riemannian metric is approximately locally constant. 3) The manifold is locally connected. UMPA models the manifold by searching for equivalent low dimensional fuzzy topological structure of data[5].
Like t-distributed stochastic neighbor embedding (t-SNE), "UMAP constructs a high dimensional graph representation of the data and then optimizes a low-dimensional graph to be as structurally similar as possible."[6] At first, UMAP measures the probability that two points are connected in high dimensions by calculating the distance between two points. It then calculates the number of closet neighbors based on the results from the previous step. In the next step, UMAP measures the probability that two points are connected in low dimensions. In the last step, by using gradient descent, UMAP minimizes the cross-entropy to preserve the data structure as much as possible[7].
There are two essential parameters in UMAP: 1) Number of nearest neighbors: This number specifies how many points should be considered to calculate the probability. The higher number will preserve the structure, but the internal structure might not be meaningful. 2) min-distance: This number determines how close each point should be in lower dimensions. The higher value preserves the structure, but two close points may not be near each other.
Since UMAP is a stochastic approach, even with the same values for the number of nearest neighbors and min-distance, the outcome may differ in each run [7].
from sklearn.preprocessing import MinMaxScaler
normalized_columns = ['age', 'avg_glucose_level', 'bmi']
normalized_values = df_imputed[normalized_columns].values
scaler = MinMaxScaler(feature_range = (0 , 1))
scaler.fit(normalized_values)
values_to_normalized = scaler.transform(normalized_values)
df_normalized = copy.deepcopy(df_imputed)
df_normalized[normalized_columns] = values_to_normalized
df_normalized.head()
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | age_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.816895 | 0.0 | 1.0 | 1 | Private | Urban | 0.801265 | 0.301260 | 0.75 | 1 | senior |
| 1 | 0.0 | 0.743652 | 0.0 | 0.0 | 1 | Self-employed | Rural | 0.679023 | 0.265178 | 0.00 | 1 | senior |
| 2 | 1.0 | 0.975586 | 0.0 | 1.0 | 1 | Private | Rural | 0.234512 | 0.254296 | 0.00 | 1 | senior |
| 3 | 0.0 | 0.597168 | 0.0 | 0.0 | 1 | Private | Urban | 0.536008 | 0.276060 | 1.00 | 1 | adult |
| 4 | 0.0 | 0.963379 | 1.0 | 0.0 | 1 | Self-employed | Rural | 0.549349 | 0.156930 | 0.00 | 1 | senior |
In our dataset, "age", "avg_glucose_level" and "bmi" attributes have numerical values; however, they have different measurement scales. Without normalization, the UMAP algorithm cannot perform properly. Therefore, the "sklearn" MinMax scaler is used to normalize values between 0 to 1.
#Ref: https://umap-learn.readthedocs.io/en/latest/basic_usage.html#penguin-data
import umap
sns.set(style='white', context='notebook', rc={'figure.figsize':(24,10)})
columns_to_use = ['age', 'avg_glucose_level', 'bmi', 'stroke']
fit = umap.UMAP(n_components = 2 , metric = 'correlation' ,
target_metric = 'categorical',
target_metric_kwds = ['stroke'],
n_neighbors = 60,
min_dist = 0.2)
umap_data = fit.fit_transform(df_normalized[columns_to_use])
#https://umap-learn.readthedocs.io/en/latest/basic_usage.html#penguin-data - Show UMAP result
plt.scatter(umap_data[:, 0], umap_data[:, 1],
c=[sns.color_palette()[x] for x in df_normalized.stroke.map({0:0, 1:3})])
plt.title("UMPA projection of the Stroke Prediction dataset", fontsize = 24);
The dark blue "cloud" of points are subjects without stroke, and the dark red "cloud" of points are subjects with stroke. The two "clouds" are well separated. Therefore, UMPA captures the 'stroke' class very well. We reduced the dimensions into two variables with this technique.
import plotly
from plotly.graph_objs import Scatter, Layout
df_stroke_categorie = np.array([])
df_stroke_categorie = df_imputed['stroke'].map({1 : 'Had a stroke', 0 : 'No stroke history'}).to_numpy()
df_stroke = pd.DataFrame(umap_data, columns=('x', 'y'))
plotly.offline.init_notebook_mode()
colors = [(lambda val: 'blue' if val == 0 else 'red') (stroke_val)
for index , stroke_val in enumerate( df_imputed['stroke'].to_numpy()) ]
plotly.offline.iplot({
'data':[
Scatter(x= df_stroke.x,
y= df_stroke.y,
marker=dict(color=colors),
text=df_stroke_categorie,
mode='markers'),
],
'layout': Layout(title='UMAP Stroke categories',
height=450,
width=800)
})
The interactive UMAP plot above conveys more information than the previous static plot. Locating your cursor on any point in the plot, you can get detailed information about the corresponding subject's 'age', 'avg_glucose_level', 'bmi', and 'stroke'.
Data analysis for the stroke prediction dataset was performed via data cleaning, visualization, and dimension reduction techniques. As one grows older, the risk for strokes also increases. If one has hypertension or heart disease, it is likely to increase the risk for strokes. Confounding factors such as genetic disorder, eating and sleeping habits, and stress level might also affect stroke risk but are not captured by the dataset.
Uniform Manifold Approximation and Projection (UMAP) captures the stroke and non-stroke group characteristics adequately. For further study, other dimension reduction techniques such as principal component analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE) can be applied to compare the performance of reduction techniques.
[1] https://www.cdc.gov/stroke/facts.htm
[2] https://www.mayoclinic.org/diseases-conditions/stroke/symptoms-causes/syc-20350113
[3] https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
[5] https://umap-learn.readthedocs.io/en/latest/
[6] https://pair-code.github.io/understanding-umap/
[7] https://www.youtube.com/watch?v=VPq4Ktf2zJ4&t=313s&ab_channel=AdrienLeroy